96 ◾ Bioinformatics
• A file with “*stats.csv” or “*stats.tab” extension is the statistics of contig/scaffold
contiguity in CSV format. The assembly statistics generated by ABySS are shown in
Figure 3.6 and described in Table 3.1.
The contig/scaffold N50 metric is the most widely used metric for describing the quality
of a genome assembly. A contig/scaffold N50 is calculated by first ordering every contig/
scaffold by length from the longest to the shortest. Next, the lengths of contigs are summed
starting from the longest contig until the sum equals one-half of the total length of all con-
tigs in the assembly. The contig/scaffold N50 of an assembly is the length (bp) of the short-
est contig/scaffold from the sequences that form 50% of the assembly. To compare between
assemblies, the longer the N50 and the smaller the L50, the better the assembly.
In Figure 3.6, the scaffolds file (ecoli-scaffolds.fa) contains 836 sequences, of which
107 sequences are more than 500 bp. The shortest sequence has 584 bp and the longest is
267,586 bp. The N50 is the sequence of length 112,320 bp and L50 (the number of scaffolds
that accounts for more than 50% of the genome assembly) is 15.
Figure 3.7 shows a diagram explaining the major metrics of the genome assembly
(N25=55, N50=70, N75=75, L25=4, L50=6, and L75=7) and how they can be computed. In
the figure, there are eight contigs ranked from the smallest to the largest. The total number
of bases is 445 Mb (100%) and the half number is 222.5 Mb (50%).
You can display both contigs and scaffolds file on a Linux terminal using the “less”
Linux command as:
TABLE 3.1 Assembly Statistics
Column
Description
N
The total number of sequences in the FASTA file
n:500
The number of sequences whose lengths are not less than 500 bp
L50
The number of scaffolds that account for more than 50% of the assembly
LG50
The number of scaffolds that account for more than 50% of the genome assembly
NG50
The sequence length of the shortest contig at 50% of the total genome length
Min
The size of the smallest sequence
N75
The sequence length of the shortest contig at 75% of the total assembly length
N50
The sequence length of the shortest contig at 50% of the total assembly length
N25
The sequence length of the shortest contig at 25% of the total genome length
E-size
The sum of the square of the sequence sizes divided by the assembly size
Max
The size of the largest sequence
Sum
The sum of the sequence sizes
Name
The file name of the assembly
FIGURE 3.6 Assembly statistics.